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Abstract 

In sparse regression modeling via regularization such as the lasso, it is important 
to select appropriate values of tuning parameters including regularization param- 
eters. The choice of tuning parameters can be viewed as a model selection and 
evaluation problem. Mallows' Cp type criteria may be used as a tuning param- 
eter selection tool in lasso-type regularization methods, for which the concept of 
degrees of freedom plays a key role. In the present paper, we propose an efficient 
algorithm that computes the degrees of freedom by extending the generalized path 
seeking algorithm. Our procedure allows us to construct model selection criteria 
for evaluating models estimated by regularization with a wide variety of convex and 
non-convex penalties. Monte Carlo simulations demonstrate that our methodology 
performs well in various situations. A real data example is also given to illustrate 
our procedure. 

Key Words: Cp, Degrees of freedom, Generalized path seeking, Model selection, Regu- 
larization, Sparse regression. Variable selection 



1 Introduction 



Variable selection is fundamentally important in high- dimensional linear regression mod- 
eling. Traditional variable selection procedures follow the best sub set sele c tion a long with 
model selection criteria such a s Akaike 's info rmation criterion (jAkaikd . 119731 ) and the 



Bayesian information criterion (jSchwarzl . 119781 ). However, t he be st subset selection is of- 
ten unstable because of its inherent discreteness ( iBreimanl . 119961 ). and then the resulting 
model has poor p rediction accuracy. To overcome this drawback of the subset selection, 
Tibshiranil ( 119961 ) proposed the lasso, which shrinks some coefficients toward exactly zero 
by imposing an Li penalty on regression coefficients, resulting in simultaneous model 
selection and estimation procedure. 
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Over the past 15 years, there has been a considerable amount of la sso- t ype p enahzation 



methods in literature: brid ge regression (IFrank and Friedmaru. 



1993 



Ful.ll998h. sm oothly 



clippe d absolute deviatiori (IFan and Lil . 120011) . elastic net ( IZou and Hastid . 120051 ) , group 
lass o (jYuan and Linl. 120061 ). adaptive lasso (IZoul . 120061). compo site absolute penalties fam- 
i ly (Zhao et al.j2009 ). minimax concave penalty (Zhang, 2010) and generalized elastic net 
([Friedman . 2008 ) along with many other regularization techniques. It is well known that 
the solutions are not usually expressed in a closed form, since the penalty term includes 
non-differentiable function. A number of researchers have presented efficient algorithms 
to obtain the entir e solutions (e.g. , least angle regression, 



descent algorithm, 
seeking. 



Friedman et al. 



Friedman . 



200 



20071 . 



2010 



Efron et al 



Mazumder et al. 



2011 



20041 coordinate 



; generalized path 



A crucial issue in the sparse regression modeling via regularization is the selection of 
adjusted tuning parameters including regularization parameters, because the regulariza- 
tion parameters identify a set of non-zero coefficients and then assign a set of variables 
to be included in a model. Choosing the tuning parameter s can be viewed as a model 
selection and evaluation problem. Mallows' Cp type criteria (IMallowsl . Il973[ ) estimate the 
prediction error of the fitted model, and give better accuracy than cross validation in 

20041 ). The concept of degrees of freedom (e.g.. Ye . 



some situations (lEfron 



1998 



1986 



Efron ■ 



20041 ) plays a key role in the theory of Cp type criteria. 



In a practical situation, however, it is difficult to directly derive an analytical expres- 
sion of (unbiased estimator of) degrees of freedom for sparse regression modeling. A few 
rese archer s have derived the analytical result s by using t he St ein's unbiased risk estima- 



tor (ISteinl . Il98ll ) for only specific penalties. 



Zou et al. 



(120071 ) showed that the number 



of non-zero coefficients is an unbiased estimate of the degrees of freedom of the lasso. 



Katd (120091 ) derived an unbiased estimate of the degrees fre edom of the lasso, group lasso 

J2OI1I ) pro- 



Mazumder et al. 



and fused lasso based on a differential geometric approach, 
posed a re-parametrization of minimax concave penalty, which enables us to calibrate 
the degrees of freedom of minimax concave family. However, these selection procedures 
do not cover more general regularization methods via convex an d non-con v ex penalties . 
In such a situation , the cross validat ion and the bootstrap (e.g. 



Shen and Ye . 



2002; 



Shen et al. 



200^ may be useful to estimate the degrees of freedom. 



Ye, 



1998 



Efron ■ 



2004 



These approaches, however, can be computationally expensive, and often yield unstable 
estimates. 

In the present paper, we propose a new algorithm that can iterativel y calculate the de - 
grees of freedom by extending the generalized path seeking algorithm (IFriedmanl . 120081 ) . 



The proposed procedure can be applied to a wide variety o: 



penalties including the generalized elastic net family ( iFriedman 



convex and non-convex 



20081 ). Furthermore, our 



algorithm is computationally-efficient, because there is no need to perform numerical op- 
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timization to obtain the solutions and degrees of freedom at each step. The proposed 
methodology is investigated through the analysis of real data and Monte Carlo simula- 
tions. Numerical results show that Cp criterion based on our algorithm performs well in 
various situations. 

The remainder of this paper is organized as follows: Section 2 briefly describes the 
degrees of freedom in linear regression models. In Section 3, we introduce a new algo- 
rithm that iteratively computes the degrees of freedom by extending the generalized path 
seeking. Section 4 presents numerical results for both artificial and real datasets. Some 
concluding remarks are given in Section 5. 

2 Degrees of freedom in linear regression models 

In linear regression models, the degrees of freedom can be used as a model complexity 
measure in Mallows' Cp type criteria. Suppose that Xj = {xy, . . . ,XNj)'^ {j = 1, • • • ,p) 
are predictors and y = (j/i, . . . , i/n)'^ is a response vector. Without loss of generality, it is 
assumed that the response is centered and the predictors are standardized by changing a 
location and employing scale transformations 

N N N 

X]2/i = 0, ^Xij = 0, ^xl = l (j = l,...,p). 

i=l 1=1 i=l 

Consider the linear regression model 

where X = {xi, . . . , Xp) is an N x p predictor matrix, (3 = . . . , /3p)^ is a coefficient 
vector and e = (ei, . . . , En)'^ is an error vector with E[£] = and V[e] = a^I. Here I is 
an identity matrix. 

The linear regression model is estimated by the penalized least square method 

/3(t) = argmin R{(3) s.t. P(/3) < t, (1) 

where R{(3) is a squared error loss function 

R{(3) = {y-XI3f{y-X(3)/N, (2) 

P(/3) is a penalty term which yields sparse solutions (e.g., the lasso penalty is P(/3) = 
"Y^- and t is a tuning parameter. An equivalent formulation of ([T]) is 

/3(A) = argmin{P(/3) + AP(/3)}, 
where A is a regularization parameter, which corresponds to t in ([T]). 
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We consider the problem of selecting an appropriate value of tuning parameter t (or A) 
by us ing Cp type criteria, for which the concept of degrees of freedom plays a key role (jYe , 
19981 ). Assume that the expectation and the variance-covariance matrix of the response 
vector y are 

E[y] = fi, V{y) = E[{y-fx){y-tif]=T'l, (3) 

where is a true mean vector and is a true variance. Given a modeling procedure m, 
the estimate /i = m{y) can be produced from the d .ata vector y. Then, the degrees of 
freedom of the fitting procedure m is defined as (jYe 

df = f;^^%^, (4) 

i=l 



1998 



Efronl . Il98d l2004h 



where fii is the ith element of fi. For example, when the estimator fi is expressed as 
a linear combination of response vector, i.e. fi = Hy with H being independent of y, 
the degrees of freedom is tr(ff). The trace o f mat rix H is referred to as an effective 
number of parameters (IHastie and Tibshiranil . Il990l l. which is widely used to select the 
tuning parameter in ridge-type regression. In sparse regression modeling such as the 
lasso, however, it is difficult to derive the degrees of freedom, since the penalty term is 
not differentiable at /3j = (j = 1, . . . ,p) so that the solutions are not usually expressed 
in a closed form. 

Mallows' Cp criterion, which is an unbiased estimator of the true prediction error, can 
be constructed with the degrees of freedom defined in (jll). Assume that the response 
vector y is generated according to (j3]), and the true expectation n is estimated by linear 
regression model. As a criterion to measur e the effectiveness of the model, we consider 



Hastie et al. 



the expected error (e.g. 

Err = EyEyncv [(A " 
where the expectation Eyncw is taken over y 



20081 ) defined by 

(^t, r^J) independent of y. 



(5) 



Lemma 2.1. The expected error in Ij^ can be expressed as 



Err = Ey\\\y 



Proof. The proof is in Appendix. 

Lemma [2711 suggests Cp criterion (e.g.. 



(6) 
□ 



Efron 



200J) 



a, 



\y - /^l 



2rMf, 



which is an unbiased estimator of the expected error in (j5]). The optimal model is selected 
by minimizing Cp. As an estimator of the true variance of r^, the unbiased estimator of 
error variance of the most complex model is usually used. 
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Table 1: Summary of model selection criteria based on the degrees of freedom. 



Criterion Formula 





lly-Af + 2r2df 




AIC 


iVlog(27rr2)+ "^"^^1 
iVlog(2J"^/l" )- 


2 

+ 2df 


AICc 


2Ardf 
^'^-iV-df-l 


BIG 


iVlog(2vrr2)+ "^"/^ 

1 \\y-(^f 

N{1- di/NY 


2 

' +logA^df 


GCV 





The degrees of fr eedom can l ead t o several model selection criteria, which are sum- 



marized in Table [TJ 



ZouetaL 



Akaike . 



( 120071 ) introduced Akaike' s inform a tion c rit erion (AIC; 



1973r) and Bayesian information criterion (BIC; 



Schwarz 



1978h . IWang et al. 



(120071 . 120091 ) showed that the Bayesian information criterion holds the consistency in model 



selection. We also in t roduc e bias corrected Akaike's information crite rion fAICr': pugiura 



1978 



Hurvich et al. 



19981 ) and generalized cross validation (GCV; 
19791 ). These two criteria do not need the true variance r^. 



Craven and Wahba 



3 Efficient algorithm for computing the degrees of 
freedom 

In this section, first, the generalized path seeking algorithm is briefly described. Then, 
a new algorithm that iteratively computes the degrees of freedom is introduced. Fur- 
thermore, we modify the algorithm to ease the computational burden for large sample 
sizes. 



3.1 Generalized path seeking algorithm 



Friedman! (120081 ) proposed the generalized path seeking, which is a fast algorithm to solve 



the problem ([T]). The generalized path seeking can produce the entire solutions that closely 
approximate those for a wide variety of convex and non-convex constraints. Suppose that 
the penalty term P{(3) satisfies following condition: 

I m 



>0|j 



(7) 



This condition defines a class of penalties where each member in the class is a mono- 
tone increasing function of absolute value of each of its arguments. For example, the 
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lasso penalty P{f3) = y^^_;^ \ 0A is inc 
Simil arly, e 
lasso (IZoul . 



astic net ( IZou and Hastie 



uded in this class, because dP((3 ) /d\(3i \ = 1 > 0. 
2005 ). group lass o (Yuan and Linl. 2006 ). adapt 



ive 



20091) . minimax con- 



20061). composi te absolute penalties family (IZhao et a 
cave penalty ( jZhangj . I2OI0I ) and generalized elastic net (jpriedmanl . 120081 ) with many other 
convex and non-convex penalties are included in this class. 

Denote f3{t) is the solution at tuning parameter t. The generalized path seeking 
algorithm starts at t = with /3(0) = 0. The solution can be iteratively computed: for 
given /3(t), the solution /3(t + At) can be produced, where At is a small positive value. 
Suppose the path 0{t) is a continuous function of t and all coefficient paths {Pj{t) \ j = 
1, . . . ,p} are monotone function of t, that is, + At)| > | j = 1, . . . For 

each step, one element of coefficient vector say (3k{t), is incriminated in a correct 

direction Xk{t) with all other coefficients remaining unchanged, i.e. 



kit + At) 
{^{t + At) 



k{t) + At-\k{t), 



(9) 



where k and Xk{t) are defined as 



argmax \gj{t)\/pj{t), 
ie{i,...,p} 



Afc(i) = 9k{t)/Pk{t)- 



(10) 



Here gj{t) and Pj{t) are 



9 At) 

Pj{t) 



dR{(3) 
dPjfB) 



The derivation of the generalized path seeking algorithm is in Appendix. 

Remark 3.1. We assumed that f3{t) is continuous and each element is monotone func- 
tion of t. Although these conditions can b e satisfied i n mo st cases, sometimes f3{t) is 
discontinuous or non-monotone function. Wriedman 1200a) proposed an approach for 
non-monotone case, which is as follows: first, we define a set S = {j \ \j{t) ■ (3j{t) < 0}. 
When S is not empty, the index k is selected by k = aYgma.Xj^g\Xj{t)\. Otherwise, 
= argmaXje|i p||Aj(t)|. The detailed description of discontinuous case is also given in 
Friedman ^200a)- 



When Pj{t) = 1 (i.e. the lasso penalty), (IHl) yields 



k{t + At)=f3k{t) + At-gk{t). 
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Note that the updated coefficient in (jS]) and flTTl) moves in the same direction even if the 
lasso penalty is not applied, because the condition in ([7]) yields sign{gk{t)) = sign{\k{t)). 
This means that the update equations ([8]) and (ITT]) produce the same solution path when 
At — )■ unless Pj(t) or l/pj{t) diverges. Therefore, we can use the update equation in 
(ITTi) instead of ([H]). If the update equation in (ITTi) is applied, an iterative algorithm that 
computes the degrees of freedom in (j4]) can be derived. 
From ([9]) and ffTTj) , the predicted value at t + At is 

flit + At) = fi{t) + At ■ gkit)xk. (12) 

Because the loss function R{(3) is squared loss as ([2]), we have gk{t) = 2x^{y — fi{t))/N . 
Thus, fi{t + At) is 

(i{t + At) = fi{t) + ^AtXkxl-{y-fi{t)). (13) 

Example 3.1. Let X he orthogonal, i.e. X'^X = I. By substituting fT^) into gj{t) = 
2xJ{y — fi{t))/N , the update equation of g jit) is 

Because of the orthogonality, we have 

g,{t) = il-2At/N)'^ ■2xJy/N, 

where tj is the number of times that jth coefficient is updated until time step t. It is shown 
that the absolute value of gj{t) is monotone non-increasing function and gj{t) — > when 
tj CO. When t — )• oo, the least squared estimates can be obtained because gj(t) — )■ for 
all i = l,...,p. 

3.2 Derivation of update equation of degrees of freedom 

Equation f[T^ suggests the update equation of the covariance matrix in (jl]) as follows: 

(14) 



cov(pt(t + At), j/) _ cov(/i(t)^ 2 cov(/x(t),?/) 
" +-AtXkX,Y - 



The degrees of freedom is iteratively calculated by taking the trace of ( iHj) . The initial 
value of cov(/i(t), 2/)/r2 is set to zero-matrix O because of the following equation: 

cov(/t(0),y) cov(0, ?/) 



r2 r2 



O. 



Let M(t) = COY {fi(t),y)/T'^ and k{t) be the index of updated element of coefficient 
vector at time step t. The update equation of the degrees of freedom in f|T^ can be 
expressed as 

/ - M(t + At) = (/ - axk^t)xl^t)){l - M{t)), (15) 
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where a = 2At/N. Then, the covariance matrix can be updated by 

M{t) =1 -{I - aXk{t-i)xl^t-i)W - (^Xk{t-2)X^t-2)) ■••(/- «a;fc(i)a;J(i)). (16) 

Example 3.2. The degrees of freedom can he easily derived whenX is orthogonal. Because 
of the orthogonality, the covariance matrix in [T^) can be calculated as 

M{t) = I -{I - axix^y^ (/ - ax2X^Y^ ■•■(/- aXpx'^Y^ 
p 

= Y.{l-{l-aY^}x,x], 
where tj is defined in Example \3.1i Then, the degrees of freedom is 

tr{M(t)} = f^{l-(l-a)*^}. 

i=i 

When t is very small, the degrees of freedom is close to since a = 2/S.t/N is sufficiently 
small. As t gets larger, the degrees of freedom increases since (1 — a)*-' > (1 — q;)*j+^. 
When tj —7- oo for all j , the degrees of freedom becomes the number of parameters, which 
coincides with the degrees of freedom of least squared estimates. 



3.3 Modification of the update equation 

The update equation in f|T2!) causes httle change in predicted values from t to t + At near 
the least squared estimates, because \gk(t)\ = \2x1{y — fi(t))/N\ is very close to zero. 
In order to overcome this difficulty, we update the k{t)th element of coefficient vector 
m times. Here m is an integer which becomes large near least squared estimates. Since 
gk{t + At) = (1 — 2At/N) gk{t) as shown in the Example 13. H the /3fc(t + mAt) is 

/3fc(t + mAt) = /3fc(t) + ^ '—At-gk{t). (17) 

a 



The update equation in flTTj) can be applied even when m is a positive real value. 

The following update equation can be used so that the coefficient is appropriately 
updated near the least squared estimates: 



4(t + mAt) = (3kit) + At ■ sign{gk{t)) 



= ^k{t) + -^At-gk{t). 

\9k{t)\ 



Equations (|T71) and (|T8ll give us 



^^log(l-c./|,.(t)|)_ (19) 
log(l - a) 
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Algorithm 1 An iterative algorithm that computes the solution and the degrees of free- 

dom. 

1: t = 0. 

2: while {\gj{t)\ > a} {j = l,...,p) do 

3: Compute {gj(t)} and {Xj{t)} (j = 1, . . . ,p). 

4: S={j\X,it)-m<0} 
5: if 5 = empty then 
6: k = argmax|Aj(t)| 

7: else 

8: k = argmax|Aj(t)| 

9: end if 

10: Compute m = log(l — a/|5'fc(t)|)/log(l — a). 

11: 4(t + mAt) = hit) + At • sign{\k{t)) 

12: 0j{t + m/\t) = Pj{t)}j^k 

13: Compute 

M{t + mAt) =/-{/- atXkit)xl^t)] (I " M{t)). 

14: Compute df (t + mAt) = tr {M(t + mAt)} 
15: t + mAt 
16: end while 



It should be assumed that a < \gk(t)\ for any step so that log(l — a/\gk{t)\) exists. 

When m is given by ( |T9|) . the update equation of the degrees of freedom in ( |T5l) can 
be replaced with 

M{t + mAt) = I - [I -atXkit)xl^t)]{I - M(t)), (20) 

where at = a/\gk{t)\ . The algorithm that computes the solutions and the degrees of 
freedom is given in Algorithm [H 

3.4 More efficient algorithm 

The update equation (!20|) suggests each step costs O(A^) operations to update the co- 
variance matrix M(t). Because the number of iterations denoted by T is usually very 
large such as T = 100000, the proposed algorithm seems to be inefficient when N is large. 
However, a simple modification of the algorithm eases the computational burden. With 
the modified process, each step costs only O(g^) operations, where q is the number of 
selected variables through the generalized path seeking algorithm: p — q variables are not 
selected at all steps for generalized path seeking algorithm. When p is very large, q is 
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smaller than p and early stopping is used. For example, suppose that p = 5000; if we 
do not want more than 200 variables in the final model, we set q = 200 and stop the 
algorithm when 200 variables are selected. 

The modified algorithm is as follows: first, the generalized path seeking algorithm is 
implemented to obtain the entire solutions. The degrees of freedom is not computed, 
whereas the value of gk{t) (t = 1, . . . , T) should be stored in the memory. Then, the QR 
decomposition of N x q matrix X* = (ajj^ ■ ■ ■ Xj^) is implemented, where Xj-^, . . . , Xj^ are 
variables selected by the generalized path seeking algorithm. Note that . . . ,jq} = q. 

The matrix X* can be written as X* = QR, where Q is an x g orthogonal matrix and 
R is a q X q upper triangular matrix. Note that Xk can be written as Qvk, where Vk is 
the g-vector which consists of kth column of R. The update equation of the degrees of 
freedom based on fl20l) is 

tr{M(t + mAt)} = tr {Q^ M{t + m At) Q} 

= q- tr{(J - atrk(t)r^t)){I - at.irk(^t-i)r^t-i)) ■■■(/- airk(i)r^i))}- 

(21) 

Therefore, the computational cost of ( |2T]) is only 0{q^). 

We provide a package msgps (Model Selection criteria via extension of Generalized 
Path Seeking), which computes Mallows' Cp criterion, Akaike's information criterion, 
Bayesian information criterion and generalized cross validation via the deg rees of freedom 



given in Table [H The package is implemented in the R programming system (IR Development Core Teaml . 



20101) , and available from Comprehensive R Archive Network (CRAN) at http : // cran . r-pro j ect . org/ wet 



4 Numerical Examples 
4.1 Monte Carlo simulations 

Monte Carlo simulations were conducted to investigate the effectiveness of our algorithm. 
The predictor vectors were generated from Gaussian distribution with mean vector zero. 
The outcome values y were generated by 

y = 0^x + e, £~Ar(0,a^). 

The following four Examples are presented here. 

1. In Example 1, 200 data sets were generated with = 20 observations and eight 
predictors. The true parameter was (3 = (3, 1.5, 0, 0, 2, 0, 0)-^ and a = 3. The 
pairwise correlation between Xi and xj was cor(i,j 

2. Example 2 was the dense case. The model was same as Example 1, but with 
/3j = 0.85 (j = 1, . . . , 8), and a = 3. 
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3. The third example was same as Example 1, but with f3 = (5, 0, 0, 0, 0, 0, 0, 0)^ and 
cr = 2. In this model, the true (3 is sparse. 



4. In Example 4, a relatively large problem was considered. 200 data sets were gener- 
ated with N = 100 observations and 40 predictors. We set 

/3 = (0^_^,2^_^,0^_^,2_^^ 

10 10 10 10 

and cr = 15. The pairwise correlation between Xi and Xj was cor(z, j) = 0.5 [i ^ j). 

In this simulation study, there are 3 purposes as follows: 

• Degrees of freedom: we investigated whether the proposed procedure can select 
adj usted tuning par ameters compared with the degrees of freedom of the lasso given 



bv IZou et al.l (120071 ) 



Model selection criteria for several penalties: the performance of model selection 
criteria given in Table [T] was com pared for the lasso, elastic net and generalized 



elastic net family (IFriedmanl . l2008l ) 



Speed: the computational time based on (!20!) was compared with that based on 
(EB. 



A detailed description of each is presented. 



Degrees of freedom 



We compared the degrees of freedom computed by our procedure {dfgps, where qys means 
genera lized path seeking) with the degrees of freedom of the lasso proposed by 



ZouetaL 



(120071 ) (df^ott)- The degrees of freedom of the lasso is the number of non-zero coefficients. 
Our method and Zou's et al. (2007) procedure do not yield identical result, since the 
df^o^ is an unbiased estimate of the degrees of freedom while df^p^ is the exact value of 
the degrees of freedom. In this simulation study, the true value of was used to compute 
the model selection criteria. 

Table [2] shows the result of mean squared error (MSE) and the standard deviation 
(SD), which are the mean and standard deviation of the following squared error (SE): 

SE{s) = ^\\fi'^^^-X(3r 

where /t^*-' is the estimate of predicted values for sth dataset. The proportion of cases 
where zero (non-zero) coefficients correctly set to zero (non-zero), say, ZZ (NN), was also 
computed. We can see that 
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Table 2: Mean squared error (MSE), the standard deviation (SD) and the percentage of 
cases where zero (non-zero) coefficie nts correct l y set to zero (non-zero), say, ZZ (NN), for 
our proposed procedure {df gps) and 



Zou et al. 



fl2007f ) (df 



Ex. 1 

dfgps df^ 



Ex. 2 

dfgps df^ 



Ex. 3 

dfgps df^ 



Ex. 4 

dfgps df^ 



MSE 2.498 

SD 1.468 

ZZ 0.592 

NN 0.925 



2.732 
1.726 
0.667 
0.892 



2.761 
1.353 



3.202 
1.727 



0.706 0.649 



0.759 
0.577 
0.607 
1.000 



0.790 
0.647 
0.767 
1.000 



41.35 42.37 

10.67 12.36 

0.586 0.636 

0.689 0.653 



• Our procedure slightly outperformed Zou's et al. (2007) one in terms of minimizing 
the mean squared error for all examples. 

• Zou's et al. (2007) procedure selected zero coefficients correctly than the proposed 
procedure, while our method correctly detected non-zero coefficients compared with 
the Zou's et al. (2007) approach. This means our procedure tends to incorporate 
many more variables than the Zou's et al. (2007) one. 



Model selection criteria for several penalties 

We compared the performance of model selection criteria based on the degrees of freedom: 
Cp criterion (Cp), bias corrected Akaike's information criterion (AICc), Bayesian infor- 
mation criterion (BIG) and generalized cross validation (GCV). Because Cp criterion and 
Akaike's information criterion (AIC) yield the same results when true error variance is 
given, the result of Akaike's information criterion is not presented in this paper. For Cp 
criterion and Bayesian information criterion, we need to estimate the true error variance 
r^. The value of was estimated by the ordinary least squares of most complex model. 
The cross validation, which is one of the most popular methods to select the tuning pa- 
rameter in sparse regression via regularization, was also applied. Since the leave-one-out 
cross validation is computationally expensive, the 10-fold cross validation was used. 

In this simulation study, we also compared the feature and performance of several 
penalties including the lasso, elastic net and generalized elastic net. The elastic net and 
generalized elastic net are given as follows: 

1. Elastic net: 

j=i ^ J 
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Here a is a tuning parameter. Note that a = yields the lasso, and a = 1 produces 
the ridge penalty. 



2. Generalized elastic net: 

p 

P{f3) = J2 log{« + (1 - < a < 1, 



where a is the tuning parameter. iFriedmanI (j2008[ ) showed the generalized elastic 



net approximates the power family penalties -P(/3) = J2 It^jV (0 < 7 < 1), whereas 
the difference occurs at very smal l absolute coeffic ients. The detailed description 



of the generaliz ed elastic net is in iFriedmanI (120081 ). A similar idea of generalized 



elastic net is in 



Candes et al. 



fcoosh . 



Note that the degrees of freedom of the lasso (Zou's et al., 2007) cannot be directly applied 
to the generalized elastic net family. 

We computed mean squared error (MSE), standard deviation (SD), the proportion of 
cases where zero (non-zero) coefficients correctly set to zero (non-zero), say, ZZ (NN). 
Tables [3l m and |5] show the comparison of model selection criteria for the lasso, elastic net 
(a = 0.5) and generalized elastic net (a = 0.5). The detailed discussion of each example 
is as follows: 

• The generalized elastic net yielded the sparsest solution, while the elastic net pro- 
duced the densest one. In Example 2, i.e. the dense case, the elastic net performed 
very well. On the other hand, in sparse case (Example 3), the generalized elastic 
net most often selected zero coefficients correctly. 

• In most cases, bias corrected Akaike's information criterion (AlCc) resulted in good 
performance in terms of mean squared error. The Bayesian information criterion 
(BIG) performed very well in some cases (e.g.. Examples 3 and 4 on the lasso). In 
dense cases (Example 2), however, the performance of Bayesian information criterion 
(BIG) was poor. 

• The performance of cross validation (GV) for generalized elastic net family was 
excellent on Example 3. On Examples 1, 2 and 4, however, the mean squared error 
was large compared with other model selection criteria based on the degrees of 
freedom. The cross validation estimates expected error by separating the training 
data from the test data. Unfortunately, the regularization method with non-convex 
penalty such as generahzed elastic net does not produce unique solution: small 
change in the training data can result in different solution. Thus, the cross validation 
may be unstable for non-convex penalty in many cases. 
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Table 3: Comparison of model selection criteria for the lasso. 









Cp 


AICc 


GCV 


BIC 


CV 


Ex. 


1 


MSE 


2.604 


2.497 


2.614 


2.567 


2.905 






SD 


1.562 


1.463 


1.583 


1.500 


1.722 






zz 


0.519 


0.562 


0.498 


0.622 


0.605 






NN 


0.925 


0.923 


0.935 


0.918 


0.903 


Ex. 


2 


MSE 


2.807 


2.772 


2.781 


2.891 


3.195 






SD 


1.435 


1.399 


1.420 


1.462 


1.712 






ZZ 
















NN 


0.733 


0.729 


0.750 


0.678 


0.686 


Ex. 


3 


MSE 


0.855 


0.790 


0.879 


0.744 


0.986 






SD 


0.666 


0.601 


0.673 


0.594 


0.692 






ZZ 


0.543 


0.573 


0.524 


0.639 


0.716 






NN 


1.000 


1.000 


1.000 


1.000 


1.000 


Ex. 


4 


MSE 


41.66 


42.91 


43.82 


39.22 


43.59 






SD 


10.72 


11.19 


11.74 


9.512 


11.83 






ZZ 


0.577 


0.545 


0.530 


0.671 


0.643 






NN 


0.692 


0.704 


0.713 


0.653 


0.655 



Table 4: Comparison of model selection criteria for the elastic net {a 









Cp 


AlCc 


GCV 


BIG 


CV 


Ex. 


1 


MSE 


2.860 


2.814 


2.826 


2.933 


3.017 






SD 


1.520 


1.487 


1.533 


1.552 


1.556 






ZZ 


0.066 


0.088 


0.061 


0.101 


0.074 






NN 


0.995 


0.993 


0.995 


0.993 


0.997 


Ex. 


2 


MSE 


2.199 


2.061 


2.169 


2.218 


2.301 






SD 


1.434 


1.303 


1.352 


1.489 


1.432 






ZZ 
















NN 


0.973 


0.967 


0.976 


0.958 


0.964 


Ex. 


3 


MSE 


1.566 


1.675 


1.577 


1.714 


1.720 






SD 


0.754 


0.834 


0.755 


0.878 


0.960 






ZZ 


0.014 


0.031 


0.016 


0.036 


0.028 






NN 


1.000 


1.000 


1.000 


1.000 


1.000 


Ex. 


4 


MSE 


25.15 


23.31 


25.55 


24.73 


24.73 






SD 


10.51 


8.656 


11.27 


9.031 


9.431 






ZZ 


0.000 


0.000 


0.000 


0.000 


0.000 






NN 


1.000 


1.000 


1.000 


1.000 


1.000 
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Table 5: Comparison of model selection criteria for the generalized elastic net {a — 0.5). 









Cp 


AICc 


GCV 


BIC 


CV 


Ex. 


1 


MSE 


3.060 


2.996 


3.051 


3.112 


3.785 






SD 


1.825 


1.810 


1.835 


1.821 


2.015 






ZZ 


0.709 


0.776 


0.695 


0.811 


0.792 






NN 


0.798 


0.778 


0.813 


0.752 


0.707 


Ex. 


2 


MSE 


3.757 


3.747 


3.658 


4.018 


4.411 






SD 


1.638 


1.587 


1.619 


1.671 


2.108 






ZZ 
















NN 


0.502 


0.462 


0.517 


0.416 


0.457 


Ex. 


3 


MSE 


0.889 


0.803 


0.949 


0.683 


0.485 






SD 


0.773 


0.722 


0.780 


0.690 


0.641 






ZZ 


0.714 


0.749 


0.679 


0.805 


0.859 






NN 


1.000 


1.000 


1.000 


1.000 


1.000 


Ex. 


4 


MSE 


74.68 


74.12 


75.15 


78.75 


80.56 






SD 


18.22 


17.97 


18.90 


16.19 


22.41 






ZZ 


0.796 


0.796 


0.766 


0.912 


0.815 






NN 


0.379 


0.382 


0.408 


0.262 


0.352 
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Table 6: Computational time (seconds) based on update equation ( 12U]) (naive) and ( 1?T]) 
(modified) averaged over 200 runs for the lasso. 







Ex. 1 


Ex. 2 


Ex. 3 


Ex. 4 






naive modified 


naive modified 


naive modified 


naive modified 


= 


100 


1.399 0.145 


1.434 0.164 


1.398 0.143 


1.521 0.343 


= 


200 


4.827 0.190 


4.834 0.246 


4.820 0.188 


5.012 0.375 


= 


500 


69.92 0.313 


69.96 0.351 


69.89 0.310 


70.38 0.459 



Speed 

The computational time based on update equation ( I20i) (naive update) was compared 
with that based on fl2T|) (modified update). In order to compare the timings for various 
number of samples, we changed the number of samples for all Examples: A^ = 100, 
200 and 500. Table [6] shows the result of timings averaged over 200 runs for lasso penalty. 
All timings were carried out on an Intel Core 2 Duo 2.0 GH processor on Mac OS X. 
Note that the "timing" means the computational time of producing the solutions and 
computing the model selection criteria. The speed based on naive update was very slow 
when A^ = 500, because we need 0(500^) operations to compute the degree of freedom for 
each step. However, the modified algorithm was fast even when the number of samples 
was large. 
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Table 7: The estimated standardized coefficients for the diabetes data based on the lasso, 
elastic net (a = 0.5) and generalized elastic net (a = 0.5). 





(Intercept) 


age 


sex 


bmi 


map 


tc 


Idl 


hdl 


tch 


Itg 


glu 


lasso 


152 





-209 


522 


303 


-120 





-224 


12 


518 


58 


enet 


152 


-2 


-220 


504 


309 


-93 


-81 


-188 


122 


460 


87 


genet 


152 





-228 


532 


326 





-70 


-288 





489 






4.2 Application to diabetes data 



The proposed algorithm was applied to diabetes data ( lEfron et al.l . |2004| ). which has 
N = 442 and p = 10. Ten baseline predictors include age, sex, body mass index (bmi), 
average blood pressure (bp), and six blood serum measurements (tc, Idl, hdl, tch, Itg, glu). 
The response is a quantitative measure of disease progression one year after baseline. We 
considered the following three penalties: the lasso, elastic net (a = 0.5), and generalized 
elastic net (a = 0.5). The entire solution path along with the solution selected by Cp 
criterion, and the degrees of freedom are presented i n Figure [H On the lasso penalty, 
the degrees of freedom of the lasso (IZou et al.l . 120071 ) is als o depicted . The degrees of 
freedom of our procedure was smaller than that of the lasso f Zou et all . 2007) except for 
||/3(t)|| G [2838,2851], where the degrees of freedom of the lasso decreased because the 
non-zero coefficient of 7th variable became zero at t = 2838. 

On the elastic net penalty with a = 0.5, the degrees of freedom increased rapidly at 
some points. For example, when ||/3(t)|| = 343.76, the degrees of freedom increased by 
about 0.864. Figure [2] shows the solution path of 2nd variable and g2{t), which is helpful 
for understanding why the degrees of freedom rapidly increased. When ||/3(t)|| attained 
343.76, the sign of g2{t) changed. At this point, (32{t) > 0. Thus, g2{t)fi2{t) < at 
||/3(t)|| = 343.76, which means ^2{t) was updated because the set 5* in line 4 in Algorithm 
[1] became 5'={2}. When \g{t)\ was sufficiently small, m in (|T9|) became very large, which 
made a substantial change in the degrees of freedom. 

The estimated standardized coefficients for the diabetes data based on the lasso, elastic 
net (a = 0.5) and generalized elastic net (a = 0.5) are reported in Table [71 The tuning 
parameter was selected by Cp criterion, where the degrees of freedom was computed via 
the proposed procedure. The generalized elastic net yielded the sparsest solution. On the 
other hand, the elastic net did not produce sparse solution. 
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Figure 1: The solution path (left panel) and the degrees of freedom (right panel). The 
vertical line in the solution path indicates the selected model by Cp crite r ion. The solid 
line of upper right panel is the degrees of freedom of the lasso (IZou et al.l . 120071 ) 
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Figure 2: The solution path (32it) (left panel) and (72 (i) (right panel) of the 2nd variable 
for elastic net with a = 0.5. 
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5 Concluding Remarks 



We have proposed a new procedure for selecting tuning parameters in sparse regres- 
sion modeling via regularization, for which the degrees of freedom was calculated by a 
computationally-efficient algorithm. Our procedure can be applied to construct model se- 
lection criteria for evaluating models estimated by the regularization methods with a wide 
variety of convex and non-convex penalties. Monte Carlo simulations were conducted to 
investigate the effectiveness of the proposed procedure. Although the cross validation has 
been widely used to select tuning parameters, the model selection criteria based on the 
degrees of freedom often yielded better results, especially for non-convex penalties such 
as generalized elastic net. 

In the present paper, we considered a computationally-efficient algorithm to select the 
tuning parameter in the sparse regression model. For more general models including gen- 
eralized linear models, multivariate analysis such as factor analysis and graphical models, 
it is also important to select appropriate values of tuning parameters. As a future research 
topic, it is interesting to introduce a new selection algorithm that handles large models 
by unifying the mathematical approach and computational algorithms. 

Appendix 

Proof of Lemma 12.11 

First, we divide AlP i^ito three terms as follows: 

ll2/"™-Af = ||y^^"-/xf + ||/x-/xf + 2(?/°--/x)V-/i). (Al) 
The term ||^ — is expressed as 

11/^ -Af = l|y - Af + l|y - - 2(?/ - - A) 

= 111/- Af - lly-^f + 2(?/-Ai)^(/i-M) 
= \\y - - \\y - /xf + 2{y - ^ififi - Eylfi]) 

+2{y-tif{Ey[fi]-fi). (A2) 

Substituting (|A2l) into (|Al]) gives us 

11?/°-- /if = ||^-«-/xf + ||^-/if + 2(y-/xf(/i-i5;^[/i]) 

+2(y - fif{Ey[fi] - /x) + 2(y-" - tififi - fi). 
By taking expectation of Eyncw, we obtain 

Eyn..[\\y-^- - fif] = Ey...[\\y^^- - tif] + \\y - fif ~ \\y - 
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+2(2/ - fj,y (A - Eylfi]) + 2(y - fiY (Eylfi] - 

Equation ([6]) can be derived by taking the expectation of Ey and using Eyncv[\\y^^"" — 
t^f] = Ey[\\y - t^f] = nT\ 

Derivation of generalized path seeking algorithm 

We derive the generahzed path seeking algorithm. First, the following lemma is provided. 
Lemma 5.1. Let us consider the following problem. 
A/3(t) = argmin[i?(/3(t) + A/3) - s.t. P0{t) + A/3) - P(/3(t)) < At. (A3) 

A/3 

The solution is A/3(t) = /3(t + At) - /3(t). 
Proof. The constraint in flA3p is written as 

P{${t) + A/3) < P(/3(t)) + At < t + At. 

Assume that (3* = /3(t) + A/3. Then, the problem (|A3l) is 

$* = argmin [i?(/3*)] s.t. P(/3*) < t + At. 

13* 

The solution is /3* = /3(t + At), which leads to A/3(t) = /3(t + At) - /3(t). □ 
When At is sufficiently small, the problem ( lASj) can be approximately written as 



A/3(t) = argmax y^Qjit) ■ A^j (A4) 

{A/3,b=l,...,p} ^.^-^ 

S.t. ■ l^/^^l + J2 ■ ■ ^l^j < ^t- (A5) 

/3,(t)=0 /3,(t)5^0 

Since all coefficient paths {/3j(t) | j = l,...,j9} are monotone functions of t, we have 
{sign0j{t)) = sign{Af3j{t)) \ j = l,...,p}. Therefore, the problem in (lASp can be 
expressed as 

p p 
A/3(t) = argmax 2.9j 

(t) ■ A(3j s.t. > Pj(t) • \ A/3j\ < At. (A6) 

{Al3,\j=l,...,p} .^^ .^^ 



The problem in ( ]A6p can be viewed as a linear programming. Then, the updates in ([8]) 
and (El) can be derived. 
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